Introduction to Open Data Science - Course Project

Chapter 1: Start me up!

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 11 10:27:22 2023"

The text continues here.

1 Some thoughts about this course

e.g., How are you feeling right now? What do you expect to learn? Where did you hear about the course?

As an R beginners, I feel this course is very friendly. Both the instructions from the teacher and the book are clear enough for me to understand and move on easily. The learning curve looks smooth, and the instructions are also encouraging. All these provide confidence for a beginner. I have been impressed by the versatile and powerful usage of R, so I want to add it into my research tool box although my research currently does not involve it (potential in the future if I own the skill). In other words, learning R is my interest or hobby at this moment. It is always good to learn some fresh things in my spare time - R is the one recently! I heard about this course from my colleague who usually can make many impressive graphs with statistical data. He became really skillful in operating R and realizing many functions after one-year learning, starting from this course one year ago. His feedback on this course and the learning outcomes really encouraged me to get engaged into R from this course.

2 Reflection on the book and the Exercise 1

How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?

This book is clear, well instructional, and encouraging for an R beginner. I enjoy learning the part of data wrangling and visualization, which is easily accessible and readable. My favorite character of R is the instant feedback after executing the commands, no matter right or wrong. This is what a researcher is eager for, but seldom meets in the academic career after making some efforts. However, I could feel challenging in the data analysis which needs specialized knowledge in statistics to interpret the R outcomes. It could be easy to run the analysis functions but not easy to understand the meaning of the results. Therefore, more things should be learned when I move on to the data analysis in R. The book has provided so many instructions to go through and practice, which are clear to follow. I think it is very nice and enough for me to get start with R. Just need to spend time in it.

4 Describe the work I have done this week and summarize my learning

Following the detail instruction in “Start me up”, I set up the technical platforms needed in this course, including R, RStudio, and GitHub. Following the teacher’s instruction, I got the know how to push the updated script to the GitHub. Then I started the learning of R with the book “R for Health Data Science”, and the corresponding exercise 1. In this week, I finished the learning of Chapter 1-5 in the book, which was a heavy and time-consuming task, but rewarding! In addition, I read the recommended Part 1 of the textbook “MABS4IODS”, and the two webinars on RStudio and GitHub.


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 11 10:27:22 2023"

Data analysis assignment

1 Read the data

Read the students2014 data into R and explore the structure and the dimensions of the data

students2014 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)

dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Description: The data set consists of 166 observations of 7 variables. The data present the information of gender, age, attitude, learning points of 166 students, and their responses to the questions related to deep, surface and strategic learning.

2 Grapical overview with a plot matrix

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# For making a nicer plot
my_lower <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_point(..., alpha=1, pch = 21, color ="black") +
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

my_diag <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_density(..., alpha=0.7,  color ="black") +
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

my_discrete <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_bar(..., alpha=1,  color ="black") +
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

my_upper_combo<- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_boxplot(...,alpha=1, color ="black") +
    geom_jitter(...,alpha = 0.5, width = 0.25, size = 1, pch = 21, color = "black")+
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

# Plot
p <- ggpairs(students2014, 
             mapping = aes(col = gender, fill = gender, alpha = 0.7), 
             lower = list(
               continuous = my_lower,
               combo = wrap("facethist", bins = 20, color = "black")
               ),
             diag = list(
               continuous = my_diag,
               discrete =  my_discrete
             ),
             upper = list(
               combo = my_upper_combo,
               continuous = wrap("cor", color = "black")
             )
             )+
  scale_fill_manual(values = c("lightblue", "wheat"))+
  scale_color_manual(values = c("lightblue", "wheat"))
  

# Print the plot
p

# Summaries of the variables in the data
summary(students2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Description: Most of the numeric variables in the data are relatively normally distributed, except for the age which is mostly distributed around twenties. Female are about two times more than males in frequency. Within the variables, attitude towards statistics seems to be most strongly correlated with exam points (r=0.437).

3 Fit a regression model

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.

# Choose attitude, stra, and surf as the three explanatory variables because they have highest (absolute) correlation with the target variable exam points, according to the plot matrix.
model1 <- lm(points ~ attitude + stra + surf, data = students2014)

# Print out the summary of the first model
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# Remove the unfit variables and new fitted regression model
model2 <- lm(points ~ attitude, data = students2014)

# Print out the summary of the fitted model
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Description: “Attitude” was the single significant predictor in the model. Other two variables entered model–“stra” and “surf”–had p-values 0.12 and 0.47, respectively. Therefore, the model was re-fitted with “attitude” alone, producing a final model that explains 19.06% (R squared = 0.1906) of the variance of the response (exam points).

4 Interpret the fitted model

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.

summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Description: The resulting model shows attitude is a meaningful predictor for exam points, specifically:

- a. With a unit of increase in attitude, the exam will increase by 3.5 points

- b. When the effect of attitude is held as none, the average exam points is 11.637. In other words, students would be expected to have 11.637 points in the exam if attitude did not influence at all.

- c. The model predicts a fairly small proportion (19%) of change in exam points. In other words, attitude is not a good enough predictor for exam points, even though its role in influencing the results should be recognized. Random error or other factors should have played roles in the exam points. 

5 Diagnostic plots

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))

Description:

   a. Residuals vs fitted plot shows that the data points are randomly scattered around the dotted line of y = 0, and the fitted line (red) is roughly horizontal without distinct patterns or trends, indicating a linear relationship. The linearity assumption of linear regression is examined. 
   
   b. The QQ plot shows that most of the points plotted on the graph lies on the dashed straight line, except for the lower and upper ends, where some points deviated from the line, indicating the distribution might be slightly skewed. Nevertheless, the assumption of normality can be approximately met, considering that in large sample size, the assumption of linearity is almost never perfectly met. 
   
   c. The Residuals vs Leverage plot indicates that there is no three outlier observation with large Cook’s distances. If the plot shows any outlier observation, they are recommended to be removed from the data because they have high influence in the linear model. 

Chapter 3: Logistic Regression

date()
## [1] "Mon Dec 11 10:27:25 2023"

1 Read the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
# read the data
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
# print out the names of the variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 370  35
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

This data set includes the information of 370 students’ background, achievement, and alcohol consumption in two Portuguese schools. There are 35 variables in the data, including student grades, demographic, social and school related features, as well as students’ performance in Mathematics (mat) and Portuguese language (por). The data was collected by using school reports and questionnaires.

2 Hypothesis

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.

Based on my everyday observation and some research reports about teenagers’ alcohol consumpution, I would like to choose family relationships (“famrel”), number of school absences (“absences”), weekly study time (“studytime”) and frequency of going out with friends (“goout”) as candidate indicators to predict the alcohol consumption. Htpothesis: a student belongs to the group of high alcohol consumption if he or she (1) has low quality family relationship; (2) more frequency of school absences; (3) less weekly study time; and (4) more frequency of going out with friends.

3 The distribution of the chosen variables and their relationships with alcohol consumption

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.

3.1 The distribution of the chosen variables

library(ggplot2)
library(tidyr)

alc |> 
  select(absences, studytime, famrel, goout) |> 
  pivot_longer(everything(), 
               names_to = "variable", 
               values_to = "value") |> 
  ggplot(aes(x = value))+
  facet_wrap(~variable, scales = "free")+
  geom_bar()+
  labs(title = "Distribution of the interested variables",
       x  = "Values of each variable",
       y = "Frequency")

  1. The distribution of “absences” is skewed to right with very long tail. This indicates that most students have full or almost full attendance of class, while a small number of students might be absent for quite a number of classes.
  2. Other three variables are obtained by item with Likert-marks, with labeling value ranging from 1 to 5. For family relationship quality, it is found most students having good or very good family relationship quality, with more around 2/3 of them being at the high end the choices.
  3. It is found most students tend to be very social to spend much time going out with friends.
  4. For study time, most students spend 2 to 5 hours in a week studying, and very few of them spend more than 10 hours a week in studying.

3.2 The relationships between interesting variables and alcohol consumption

# The relationship between students' absences of class and alcohol consumption (box plot for numerical~categorical variables)
p1 <- alc |> 
  ggplot(aes(x = high_use, y = absences)) +
  geom_boxplot() +
  geom_jitter(width=0.25, alpha=0.5)+
  labs(x = "Alcohol consumption", 
       y = "Freuqncy of class absences",
       title = 
         "Frequency of class absences and alcohol consumption")+
  scale_x_discrete(labels = c("FALSE" = "Non-high-user", 
                              "TRUE" = "high-user"))
p1 

# The relationship between quality of family relation and alcohol consumption (bar plot for categorical variables)
p2 <- alc |> 
  ggplot(aes(x = factor(famrel), fill = high_use)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Quality of family relationships", 
       y = "Proportion of alcohol high-users",
       title = 
         "Quality of family relationships and alcohol consumption")+
  guides(fill=guide_legend("Alcohol consumption"))+
  scale_fill_discrete(labels = c("FALSE" = "Non-high-user", 
                                 "TRUE" = "high-user"))
p2

# The relationship between going out with friends and alcohol consumption (bar plot for categorical variables)
p3 <- alc |> 
  ggplot(aes(x = factor(goout), fill = high_use)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Going out with friends", 
       y = "Proportion of alcohol high-users",
       title = 
         "Going out with friends and alcohol consumption")+
  guides(fill=guide_legend("Alcohol consumption"))+
  scale_fill_discrete(labels = c("FALSE" = "Non-high-user", 
                                 "TRUE" = "high-user"))
p3

# The relationship between weekly study time and alcohol consumption (bar plot for categorical variables)
p4 <- alc |> 
  ggplot(aes(x = factor(studytime), fill = high_use)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Weekly study time", 
       y = "Proportion of alcohol high-users",
       title = 
         "Weekly study time and alcohol consumption")+
  guides(fill=guide_legend("Alcohol consumption"))+
  scale_fill_discrete(labels = c("FALSE" = "Non-high-user", 
                                 "TRUE" = "high-user"))
p4

# combining four plots together
library(patchwork)
p1 + p2 + p3 + p4

Four plots were made to explore the relationships between four variables and the alcohol consumption. (1) The box plot shows that the frequency and median(Q1,Q3) of class absences differed greatly between alcohol high-users and non-high-users. The students with high alcohol use are associated with more class absences, as hypothesized. (2) The first bar plot shows a difference result from the hypothesis in terms of the relationship between quality of family relation and alcohol consumption. Students who have worst relationship with their family are not the highest consumption group. Rather the students who consume most alcohol are those who have bad or middle level of family relationship. (3) The second bar plot shows that the more frequency students going out with their friends, the more alcohol consumption, as hypothesized. (4) The third bar plot shows that the more time students spend in studying every week, that less alcohol consumption, as hypothesized.

4 Logistic regression

Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables, see for example the RHDS book or the first answer of this stack exchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this).

# Fit the model base on the original hypothesis
model <- glm(high_use ~ absences + famrel + goout + studytime, data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences + famrel + goout + studytime, 
##     family = "binomial", data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.15221    0.71444  -1.613 0.106798    
## absences     0.06655    0.02192   3.036 0.002394 ** 
## famrel      -0.35898    0.13796  -2.602 0.009266 ** 
## goout        0.75990    0.12143   6.258 3.91e-10 ***
## studytime   -0.57194    0.16963  -3.372 0.000747 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.36  on 365  degrees of freedom
## AIC: 380.36
## 
## Number of Fisher Scoring iterations: 4
# Important parameters
OR <- coef(model)
CI <- confint(model)
## Waiting for profiling to be done...
ORCI <- cbind(OR, CI)
print(ORCI, digits = 1)
##                OR 2.5 % 97.5 %
## (Intercept) -1.15 -2.57   0.24
## absences     0.07  0.02   0.11
## famrel      -0.36 -0.63  -0.09
## goout        0.76  0.53   1.01
## studytime   -0.57 -0.92  -0.25

All of the hypothesized predictors have good level of statistical significance in the model (p<0.01), indicating that all of the four hypothesized predictors are significant in predicting alcohol consumption. These four predictors have different influences on alcohol consumption. (1) Absences: Participants who have one more time of absence from class will on average have 0.067 (95%CI: 0.02~0.11) times more odds being an alcohol high-user. (2) Quality of family relationships: Every unit of family relationship quality increase would lead to 0.36 (95%CI: 0.09~0.63) times less odds being alcohol high-user. (3) Going out with friends: Students who have one more level of social involvement with their friends have 0.76 (95%CI: 0.53~1.01) times more odds of being alcohol high-users. (4) Weekly study time: Those who have one more level of weekly study time have on average 0.57 (95%CI: 0.25~0.92) times less odds to be an alcohol high-user. These findings are consistent with previous hypotheses.

5 Predicted probabilities and cross tabulation

Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

# Explore the predictive power of the model
probabilities <- predict(model, type = "response")
# Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# See the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, famrel, goout, studytime, high_use, probability, prediction) %>% tail(10)
##     absences famrel goout studytime high_use probability prediction
## 361        3      4     3         2    FALSE  0.22224385      FALSE
## 362        0      4     2         1    FALSE  0.16242879      FALSE
## 363        7      5     3         1     TRUE  0.31573116      FALSE
## 364        1      5     3         3    FALSE  0.08975198      FALSE
## 365        6      4     3         1    FALSE  0.38200795      FALSE
## 366        2      5     2         3    FALSE  0.04697548      FALSE
## 367        2      4     4         2    FALSE  0.36371174      FALSE
## 368        3      1     1         2    FALSE  0.15505370      FALSE
## 369        4      2     5         1     TRUE  0.83529411       TRUE
## 370        2      4     1         1     TRUE  0.09388808      FALSE
# Tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% 
  addmargins
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   233   26 259
##    TRUE     63   48 111
##    Sum     296   74 370
# Display a graphic visualizing actual high_use and predictions
p5 <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+
  geom_point()
p5

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%
  prop.table()%>%
  addmargins()%>%
  print(digits = 2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.63 0.07 0.70
##    TRUE   0.17 0.13 0.30
##    Sum    0.80 0.20 1.00

Among 259 participants who are not alcohol high-users, the model correctly predicts 233 (90%) of them. Among 111 participants who are alcohol high-users, the model correctly predicts 63 of them (57%) of them. In all, among the 370 predicts, 74 (20%) were inaccurate.

6 10-fold cross-validation (bonus)

Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model?

# Define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2405405
# K-fold cross-validation
library(boot)
set.seed(1)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
# average number of wrong predictions in the cross validation
delta.percent <- paste0((cv$delta[1]|> round(4))*100, "%") 

The prediction error rate is 24%, outperforming the model in Exercise Set 3, which had about 26% error. According to the result of 10 fold cross-validation, the model has an average error rate of 23.51%, a bit lower than the results from training model.

7 The relationship between prediction error and number of predictors (super bonus)

Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.


Chapter 4: Clustering and classification

date()
## [1] "Mon Dec 11 10:27:27 2023"

1 Load the Boston data

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The data set is about housing values in suburbs of Boston. There are 506 observations of 14 variables, including numeric and integer variables. The 14 variables respectively refer to: 1. crim: per capita crime rate by town. 2. zn: proportion of residential land zoned for lots over 25,000 sq.ft. 3. indus: proportion of non-retail business acres per town. 4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 5. nox: nitrogen oxides concentration (parts per 10 million). 6. rm: average number of rooms per dwelling. 7. age: proportion of owner-occupied units built prior to 1940. 8. dis: weighted mean of distances to five Boston employment centres. 9. rad: index of accessibility to radial highways. 10. tax: full-value property-tax rate per $10,000. 11. ptratio: pupil-teacher ratio by town. 12. black: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town. 13. lstat: lower status of the population (percent). 14. medv: median value of owner-occupied homes in $1000s.

2 Graphical overview

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

# Show summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# First make a distribution plot for 14 variables
library(ggplot2)
library(tidyr)

# Convert the data to long format for easier plotting
long_boston <- gather(Boston)

# Plotting
p1 <- ggplot(long_boston, aes(x = value)) +
  geom_density(fill = "skyblue", color = "black") +
  facet_wrap(~key, scales = "free") +
  theme_minimal() +
  labs(title = "Overview of Boston dataset")

# Print the plot
p1

# Then make a correlations plot to look at the correlations among the 14 variables in the data
library(corrplot)
## corrplot 0.92 loaded
# Calculate the correlation matrix and round it
cor_matrix <- cor(Boston) |> 
  round(digits = 2)

# Print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# Visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")

From the distribution plot, most of the variables are skewed to right or left direction. Only rm is realatively normally distributed. From the correlation matrix, the strongest correlations are between the variables rad and tax (positive), dis and age (negative), dis and nox (negative), dis and indus (negative), lstat and medv (negative).

3 Standardize the dataset; create a categorical variable of the crime rate; divide the dataset to train and test sets

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

3.1 Standardize the dataset

# Center and standardize variables
boston_scaled <- scale(Boston) |> 
  as.data.frame()

# Summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling the data, all the means of the variables turn to zero. Most of the values of the variables ranged from -4 and 4, only except for crim (crime rate).

3.2 Create a categorical variable of the crime rate

# Summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# Create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,
             labels = c("low", "med_low","med_high", "high"))

# Look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

3.3 Divide the dataset to train and test sets

library(dplyr)

# Number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# Choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

# Create test set 
test <- boston_scaled[-ind,]

4 Linear discriminant analysis

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

# Fit the linear discriminant analysis
lda.fit <- lda(crime~., data = train)

# Print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2450495 0.2400990 0.2599010 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.94768689 -0.9154164 -0.15765625 -0.8664804  0.49324129 -0.8330934
## med_low  -0.06128399 -0.2905159 -0.03371693 -0.5543909 -0.09719511 -0.3794097
## med_high -0.37054365  0.2063740  0.05238023  0.3768264  0.15156095  0.3976410
## high     -0.48724019  1.0149946 -0.04735191  1.0267822 -0.39946717  0.8122921
##                 dis        rad        tax    ptratio       black        lstat
## low       0.8750206 -0.6897369 -0.7780503 -0.4386754  0.38100748 -0.767876258
## med_low   0.3826266 -0.5456857 -0.4654718 -0.1017024  0.31146285 -0.162813217
## med_high -0.4097682 -0.3756701 -0.2603363 -0.2604135  0.09270391  0.006875968
## high     -0.8535887  1.6596029  1.5294129  0.8057784 -0.76867030  0.834420910
##                medv
## low       0.5582392
## med_low   0.0121384
## med_high  0.2007779
## high     -0.6528397
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11885002  0.66122613 -0.98747678
## indus   -0.04130912 -0.23640126  0.23839579
## chas    -0.07358997 -0.01543395  0.21302462
## nox      0.33296411 -0.76745164 -1.35199275
## rm      -0.08676976 -0.07017084 -0.15558456
## age      0.29552614 -0.15968266 -0.24988799
## dis     -0.11193174 -0.11699745  0.17008638
## rad      3.12327648  1.09445683 -0.22148103
## tax      0.10112350 -0.22764844  0.78165323
## ptratio  0.14306963  0.10445909 -0.33126409
## black   -0.15592597 -0.01390650  0.09298486
## lstat    0.17766584 -0.24785436  0.32723953
## medv     0.17126083 -0.36432554 -0.26921116
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9537 0.0335 0.0128
# Draw the LDA (bi)plot
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Target classes as numeric
classes <- as.numeric(train$crime)

# Plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Biplot based on LD1 and LD2 was generated. From the LDA biplot, the clusters of low, med_low, and med_high classes separate poorly. Heavy overlap was observed between these three clusters. The cluster of high class separates well. But the clusters high and med_iumH_high also showed notable overlaps. Based on arrows, varaibles rad explained the most for cluster of high class. Contributions of variables to other clusters are not clear enough due to the heavy overlap.

5 Predict LDA

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.

# Save the correct classes from test data
correct_classes <- test$crime

# Remove the crime variable from test data
test <- dplyr::select(test, -crime)

# Predict classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       8        0    0
##   med_low    5      16        6    0
##   med_high   0      15       14    0
##   high       0       0        1   21

The cross tabulated results show that most of the predictions on the classes of med_low, med_high, and high are correct. But the prediction of the low class has just only less than a half correctness. This result shows a not so satisfactory predicting effect of our linear discriminant analysis.

6 Distance measure

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

6.1 Reload the Boston dataset and standardize the dataset

data("Boston")
boston_scaled <- scale(Boston) |> 
  as.data.frame()

6.2 Calculate the distances between the observations

# Euclidean distances matrix
dist_eu <- dist(boston_scaled)

# Look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

6.3 Determine K-means for the optimal number of clusters

set.seed(123)
k_max <- 10

# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.

6.4 Visualize the clusters and interpret the results

# K-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)

# Plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[1:7], col = km$cluster)

pairs(boston_scaled[8:14], col = km$cluster)

Most of the variables are influential linear separators for the clusters, except rad, ptratio, and tax.

7 Super-Bonus: 3D plot

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p3 <- plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        color = train$crime, #Set the color to be the crime classes of the train set. 
        type= 'scatter3d', 
        mode='markers',
        size = 2)
p3
# Draw another 3D plot where the color is defined by the clusters of the k-means
# Get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 2) 

p4 <- plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type= 'scatter3d', 
        mode='markers', 
        color = factor(train.km$cluster), #color defined by clusters of the k-means
        size = 2)
p4
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

The LDA was trained according to a mathematical category of crime rates (quantiles), which has four categories. While k = 2 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 2 clusters identified before observing closely. However, by observing the 3D plots together, it is interesting to find out that, cluster 2 from k-means nicely overlaps with high category from LDA. Also, cluster 1 from k-means roughly overlaps with the other three categories from LDA.


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Mon Dec 11 10:27:31 2023"

1 Overview of the data

Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

1.1 Read the data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ plotly::select() masks MASS::select(), dplyr::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.2 Move the country names to rownames

human_1 <- column_to_rownames(human, "Country")

summary(human_1)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

1.3 Graphical overview

library(GGally)

# For making a nicer plot, creat some functions
# Define a function that allows me for more control over ggpairs. This function produces point plot with fitted lines.
my.fun.smooth <- function(data,    # my function needs 3 arguments
                          mapping,
                          method = "lm"){
  ggplot(data = data, # data is passed from ggpairs' arguments
         mapping = mapping)+ # aes is passed from ggpairs' arguments
           geom_point(size = 0.3,  # draw points
                      color = "black")+
           geom_smooth(method = method,  # fit a linear regression
                       size = 0.3, 
                       color = "red")+
           theme(panel.grid.major = element_blank(), # get rid of the grids
                 panel.grid.minor = element_blank(),
                 panel.background = element_rect(fill = "wheat", #adjust background
                                                 color = "black"))
} 

# Define a function that allows me for more control over ggpairs. Tthis function produces density plot.
my.fun.density <- function(data, mapping, ...) { # notes are roughly same with above

    ggplot(data = data, mapping = mapping) +
       geom_histogram(aes(y=..density..),
                      color = "black", 
                      fill = "white")+
       geom_density(fill = "#FF6666", alpha = 0.25) +
       theme(panel.grid.major = element_blank(), 
             panel.grid.minor = element_blank(),
             panel.background = element_rect(fill = "lightblue",
                                             color = "black"))
} 

ggpairs(human_1, #data
        progress = FALSE,
        lower = 
          list(continuous = my.fun.smooth), # lower half show points with fitted line
        diag =
          list(continuous = my.fun.density), # diagonal grids show density plots
        title = "Relationships between variables") + # title
  theme (plot.title = element_text(size = 22,  # adjust title visuals
                                   face = "bold"))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Only the expected years of education (Edu.Exp) is normally distributed. The rest of variables are skewed in one way or another. Most of the variables have strong correlation with other variables, except Parli.F and Labo.F.

2 Principal component analysis (PCA) and biplot

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.

2.1 PCA

# Perform principal component analysis on the raw (non-standardized) human data
pca_human <- prcomp(human_1)
print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

PC1 explains 99.99% of the variability of the data set. Other components’ contribution is less the 0.1% in totality.

2.2 Draw a biplot

biplot_1 <- biplot(pca_human, choices = 1:2,
       cex = c(0.8, 1),
       col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

In the biplot, red texts are variables, and grey texts are observations (countries). The position of GNI is far away from the origin (0,0) in the direction of x axis (PC1), indicating its strong contribution to PC1. Most of the countries clustered tightly around the origin (0,0), which indicates that they are not well-represented on the factor map.

3 Repeat PCA with standardized data and biplot

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.

3.1 Standardize the variables in the human data

human_std <- scale(human_1)
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

3.2 Perform PCA again

pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.071 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.536 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.536 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

After standardizing, PC1 explains 53.6% of the variability of the data set. PC2 explains the other 16%. PC1 and PC2 together can explain about 70% of the variability of the data set.

3.3 Biplot

biplot_2 <- biplot(pca_human_std, choices = 1:2,
       cex = c(0.8, 1),
       col = c("grey40", "deeppink2"))

After standardizing data set, row and column points are more well scattered across the coordinate panel and all the variables are visualized more reasonably. The scattered country names are well-represented in the factor map.

4 Personal interpretations of two PCAs and biplots

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.

Base on finding above, it is not hard to draw the conclusion that PCA using standardized data set produces results better for analysis. Possible explanation for this is the variables with different scales make the comparison between pairs of features difficult. PCA is calculated based on co-variance. Unlike correlation which is dimensionless, covariance is in units obtained by multiplying the units of the two variables. When data set is not scaled, this makes each variable not easily comparable with others (since they all have their own value ranges). Further, each variable loads almost exclusively on one components because they can hardly find another variable with comparable value range. This assumption is further consolidated by the fact that the only two variables with smaller loading scores are Edu2.FM and Labo.FM, both of which happen to have similar value range from 0 to 1. Also, co-variance also gives some variable extremely high leverage in our data set. Look back to the summary of the raw human data, “GNI” has a scale tremendously larger than other variables. This might lead to its large co-variances with any other variable, and further results in its over-contribution to the factor solution. All of these mis-representation of data would end up the poor quality of contribution, and hence the biplot shows most of the countries clustered tightly together, indicating the PCA has not produced a factor map with acceptable dissimilarity among rows. Also, the over-contribution of GNI to the factor solution leads to a graph with only one variable–GNI–showing in a visible distance (others overlap heavily around the center).

5 Tea data

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

5.1 Load the tea dataset and convert its character variables to factors

# Load the data
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# Convert its character variables to factors
# According to the structure, the characters are already factors.

5.2 Multiple Correspondence Analysis (MCA)

Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.

# Select columns
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# Visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# MCA
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# Visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")

### 5.3 Interpretation

Multiple Correspondence Analysis (MCA) is a multivariate statistical technique used for analyzing the relationships between categorical variables. MCA results are often presented in terms of principal components.

Eigenvalues in the summary represent the amount of variance captured by each principal component. Higher eigenvalues indicate more important dimensions. The two most important dimensions explain respectively 15.238% and 14.232% variables.

The Plot is a variable plot which examines the dispersion of categories in six variables along the two principal components. Categories close to each other are positively associated, while those far apart are negatively associated. And the relationships between categories can be seen from their degree of proximity. The positions of the categories on the principal components can indicate the contribution of the dimension. For example, sugar and no suger contribute little to both dimensions, while enjoying at tea shop and unpackaged tea contibute greatly in the first dimension.


Chapter 6: Analysis of longitudinal data

date()
## [1] "Mon Dec 11 10:27:36 2023"

1. Implement the analyses of Chapter 8 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART I but using the RATS data (from Chapter 9 and Meet and Repeat: PART II).

2. Implement the analyses of Chapter 9 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART II, but using the BPRS data (from Chapter 8 and Meet and Repeat: PART I).

1 Load the wrangled data

library(tidyverse)
# Load the data from the assignment of data wrangling, including BPRS data and RATS data
source("meet_and_repeat.r")
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr [1:360] "week0" "week0" "week0" "week0" ...
##  $ bprs     : int [1:360] 42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame)
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ time  : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...

2 Analysis: PART I (RATS data)

# Check the content of the data
names(RATS)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36"  "WD43" 
## [10] "WD44"  "WD50"  "WD57"  "WD64"
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
summary(RATS)
##        ID     Group      WD1             WD8             WD15      
##  1      : 1   1:8   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  2      : 1   2:4   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  3      : 1   3:4   Median :340.0   Median :345.0   Median :347.5  
##  4      : 1         Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  5      : 1         3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  6      : 1         Max.   :555.0   Max.   :560.0   Max.   :565.0  
##  (Other):10                                                        
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##                                                                 
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0  
## 
names(RATSL)
## [1] "ID"     "Group"  "WD"     "Weight" "time"
str(RATSL)
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame)
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ time  : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
##        ID      Group       WD                Weight           time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

Description of the RATS data: There are 16 rats in this nutrition study which named with ID from 1 to 16. These 16 rats were divided into three treatment groups. Group 1 includes eight rat, while Group 2 and Group 3 respectively have four rats. All these 16 rats in three treatment groups took 11 repeated measurements, resulting in 176 (16x11) unique observations. Three groups were put on different diets, and each rat’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.

2.1 Individuals on the plot

# Plot the weights of all the observation over the time in a plot, differentiating three treatment groups.
library(ggplot2)
RATSL |> 
  ggplot(aes(x = time, y = Weight, group = ID, color = Group))+
  geom_line()+
  geom_point()+
  labs(title = "Change of weight in three treatment groups",
       x = "Time (days)",
       y = "Weight (grams)")+
  theme(plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_rect(fill = "white",
                                        color = "black"),
        panel.grid.major = element_line(color = "grey", size = 0.2),
        panel.grid.minor = element_line(color = "grey", size = 0.2),
        strip.background = element_rect(color = "black",#adjust the strips aes
                                        fill = "steelblue"),
        strip.text = element_text(size =10, 
                                  color = "white"),
        legend.position = "none")+
  facet_wrap(~Group)
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Findings: 1. All individuals increased weights over the time. 2.The rats of the Group 1 are far lighter than the rats in Group 2 and 3. The rats in Group 2 and 3 have similiar level of weights, comparing to Group 1.

2.2 Standardization

In order to reduce the influence of the original weight in tracking the changes over the time, we can standardized the observed values (weights) of each observation, i.e., \[standardised(x) = \frac{x - mean(x)}{ sd(x)}\]

# Standardise the variable weight
RATSL <- RATSL |> 
  group_by(time) |> 
  mutate(stdwgt =  (Weight - mean(Weight))/sd(Weight))  |> 
  ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
## $ stdwgt <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0.…
# Plot again with the standardised weight
RATSL |> 
  ggplot(aes(x = time, y = stdwgt, group = ID, color = Group))+
  geom_line()+
  geom_point()+
  labs(title = "Change of standardised weight in three treatment groups",
       x = "Time (days)",
       y = "Standardized weight")+
  theme(plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_rect(fill = "white",
                                        color = "black"),
        panel.grid.major = element_line(color = "grey", size = 0.2),
        panel.grid.minor = element_line(color = "grey", size = 0.2),
        strip.background = element_rect(color = "black",#adjust the strips aes
                                        fill = "steelblue"),
        strip.text = element_text(size =10, 
                                  color = "white"),
        legend.position = "none")+
  facet_wrap(~Group)

Findings: The changes of the weight look not so obvious over the time for all the individual rats.

2.3 Summary graphs with standard error

In addition to displaying the individual profiles, another approach is showing average (mean) and standard error of mean profiles for each treatment group along with some indication of the variation of the observations at each time point. \[se = \frac{sd(x)}{\sqrt{n}}\]

# Summary data with mean and standard error of rats by group and time
rats.group <- RATSL |> 
  group_by(Group, time) |> 
  summarise(mean = mean(Weight),se = sd(Weight)/sqrt(n())) |> 
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(rats.group)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87…
# Plot the mean profiles of the groups over the time
library(ggplot2)

rats.group |> 
  ggplot(aes(x = time, 
             y = mean, 
             shape = Group,
             color = Group))+
  geom_line()+
  geom_point(size=3)+
  theme(legend.position = c(0.9,0.5),
        panel.background = element_rect(fill = "white",
                                        color = "black"),
        panel.grid = element_line(color = "grey",
                                  size = 0.1),
        axis.text = element_text(size = 10),
        axis.title = element_text (size = 13),
        plot.title = element_text(size = 15,
                                  face = "bold")) + 
  labs(title = "Change of average weight of three groups over the time",
       x = "Time(days)",
       y = "Average(grams)")+
  scale_color_manual(values = c("wheat4", "steelblue", "darkred"))

# Add error bar (mean±2se) to the plot
## Create an object that saves dodge position so that point and line dodge simultaneously (for preventing overlap)
dodgeposition <- position_dodge(width = 0.3)

rats.group |> 
  ggplot(aes(x = time, 
             y = mean, 
             shape = Group,
             color = Group))+
  geom_line(position = dodgeposition) + #dodge to avoid overlap
  geom_point(size=3, position = dodgeposition) +#dodge to avoid overlap
  geom_errorbar(aes(ymin=mean-2*se, ymax=mean+2*se), 
                width=0.5, #set width of error bar
                position =dodgeposition) +#dodge to avoid overlap
  theme(legend.position = c(0.9,0.5),
        panel.background = element_rect(fill = "white",
                                        color = "black"),
        panel.grid = element_line(color = "grey",
                                  size = 0.1),
        axis.text = element_text(size = 10),
        axis.title = element_text (size = 13),
        plot.title = element_text(size = 15,
                                  face = "bold")) + 
  labs(title = "Change of weight statistics (mean±2se) of three groups over time",
       x = "Time(days)",
       y = "Average weight±2se(grams)")+
  scale_color_manual(values = c("wheat4", "steelblue", "darkred"))

Findings: 1. The average weights of each group increased with mild slopes over the time. 2. The average weights of the rats among the groups: Group 3 > Group 2 > Group 1. 3. According to the error bar (no overlap between Group 1 and Group 2,3), rats differed tremendously in weight from the very outset and kept the significant group difference over the time.

2.4 Find and remove the outlier

# Create a summary data by groups and subject with mean as the summary variable (ignoring baseline WD1, "RATSL_nobase")
library(dplyr)
RATSL_nobase <- RATSL |> 
  filter(time > 1) |> 
  group_by(Group, ID) |> 
  summarise(mean=mean(Weight) ) |> 
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATSL_nobase)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# Draw a boxplot of the mean versus groups
library(ggplot2)
RATSL_nobase |> 
  ggplot(aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "red") +
  scale_y_continuous(name = "mean(weight(g)), time 8-64")

# Find and filter the outliners

## Create a function that detect outliers
is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

## Mutate a new variable "outlier" to tag outlier weight
ratsl_outl <- RATSL_nobase|>  
  group_by(Group) |>  
  mutate(outlier = ifelse(is_outlier(mean), ID, as.factor(NA))) #create outlier label

## Find the outlier in the boxplot
ratsl_outl |> 
  ggplot(aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "red") +
  scale_y_continuous(name = "mean(weight(g)), time 8-64")+
  geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)

## Filter the outlier ("RATSL_nobase1") and adjust the ggplot code the draw the plot again with the new data
RATSL_nobase1 <- ratsl_outl |> 
  filter(is.na(outlier))

ggplot(RATSL_nobase1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "red") +
  scale_y_continuous(name = "mean(weight(g)), time 8-64")

After removing the outliers, IQRs are much narrower than those before removing the outliers.

2.5 Differences across groups: ANOVA and linear model

Apply ANOVA to assess any difference among three treatment groups. Use the data without the outliers.

anova <- aov(mean ~ Group, data = RATSL_nobase1)
summary(anova)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 176917   88458    2836 1.69e-14 ***
## Residuals   10    312      31                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value < 0.05 indicates that there are differences in the means of weights across three groups. In order to know which pair(s) of groups have significant difference, apply pairwise T-Test between groups.

pairT <- TukeyHSD(anova)
print(pairT)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mean ~ Group, data = RATSL_nobase1)
## 
## $Group
##          diff       lwr       upr p adj
## 2-1 183.64286 173.07885 194.20686     0
## 3-1 269.50952 258.94552 280.07353     0
## 3-2  85.86667  73.36717  98.36617     0

There are statistically significant differences (p < 0.0001) between each pair of groups. But we don’t know whether these differences are only due to the differences across groups, or also due to the differences in the baseline. Thus, linear regression with baseline as a new variance can help us to find the influence of baseline in the mean differences across groups.

# Add the baseline from the original data as a new variable to the summary data
RATSL_baseline <- inner_join(RATSL_nobase1, RATS, by = c("Group","ID")) |> 
  select(Group, ID, mean, WD1) |> 
  mutate(baseline = WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ Group + baseline, data = RATSL_baseline)
summary(fit)
## 
## Call:
## lm(formula = mean ~ Group + baseline, data = RATSL_baseline)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6341 -2.8915  0.1102  2.0096  7.8989 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 221.3094    28.3120   7.817 2.66e-05 ***
## Group2      152.7218    18.7452   8.147 1.91e-05 ***
## Group3      219.6183    29.9107   7.342 4.36e-05 ***
## baseline      0.1866     0.1111   1.680    0.127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.136 on 9 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9982 
## F-statistic:  2236 on 3 and 9 DF,  p-value: 3.048e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## Group      2 176917   88458 3353.2062 1.181e-13 ***
## baseline   1     74      74    2.8219    0.1273    
## Residuals  9    237      26                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Baseline has no significant effect on the mean differences across groups. After adjusting with baseline effect, the mean differences across groups are still significant (p<0.05).

3 Analysis: PART II (BPRS data)

# Check the content of the data
names(BPRS)
##  [1] "treatment" "subject"   "week0"     "week1"     "week2"     "week3"    
##  [7] "week4"     "week5"     "week6"     "week7"     "week8"
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
summary(BPRS)
##  treatment    subject       week0           week1           week2     
##  1:20      1      : 2   Min.   :24.00   Min.   :23.00   Min.   :26.0  
##  2:20      2      : 2   1st Qu.:38.00   1st Qu.:35.00   1st Qu.:32.0  
##            3      : 2   Median :46.00   Median :41.00   Median :38.0  
##            4      : 2   Mean   :48.00   Mean   :46.33   Mean   :41.7  
##            5      : 2   3rd Qu.:58.25   3rd Qu.:54.25   3rd Qu.:49.0  
##            6      : 2   Max.   :78.00   Max.   :95.00   Max.   :75.0  
##            (Other):28                                                 
##      week3           week4           week5           week6           week7     
##  Min.   :24.00   Min.   :20.00   Min.   :20.00   Min.   :19.00   Min.   :18.0  
##  1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00   1st Qu.:22.75   1st Qu.:23.0  
##  Median :36.50   Median :34.50   Median :30.50   Median :28.50   Median :30.0  
##  Mean   :39.15   Mean   :36.35   Mean   :32.55   Mean   :31.23   Mean   :32.2  
##  3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00   3rd Qu.:37.00   3rd Qu.:38.0  
##  Max.   :76.00   Max.   :66.00   Max.   :64.00   Max.   :64.00   Max.   :62.0  
##                                                                                
##      week8      
##  Min.   :20.00  
##  1st Qu.:22.75  
##  Median :28.00  
##  Mean   :31.43  
##  3rd Qu.:35.25  
##  Max.   :75.00  
## 
names(BPRSL)
## [1] "treatment" "subject"   "weeks"     "bprs"      "week"
str(BPRSL)
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr [1:360] "week0" "week0" "week0" "week0" ...
##  $ bprs     : int [1:360] 42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
summary(BPRSL)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252

Description of the BPRS data: There are 40 males in this study and they were divided into two treatment groups. Each group has 20 males, i.e. the subject from 1 to 20. Each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia. This study applies the linear mixed effects models to study the effects of different treatments on the BPRS rating over the time, with the consideration of the variability across the individuals in the groups.

3.1 Plot the data

To begin, plot both treatment effects over weeks by individuals, but ignoring the longitudinal nature (repeated-measures structure) of the data.

library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject, color = subject)) +
  geom_line()+
  facet_wrap(~treatment)+
  theme(legend.position = "none",
        panel.grid = element_line(color = "grey", size = 0.1),
        panel.background = element_rect(color = "black",
                                        fill = "white"),
        strip.background = element_rect(color = "black",
                                        fill = "steelblue"),
        strip.text = element_text(color = "white",
                                  face = "bold",
                                  size = 10),
        axis.title  = element_text(size = 12),
        axis.text = element_text(size = 10))+
  labs(title = "Two treatment effects on BPRS over week by individuals",
       x = "Time (weeks)",
       y = "BPRS rating")

Findings: 1. The BPRS rating of the participants decreased in both treatment groups over the time. 2. The changing trajectories and the starting points across individuals differed greatly. 3. It can not be straightforward to see which treatment is better.

3.2 Multiple linear regression

Continuing to ignore the repeated-measures structure of the data, fit a multiple linear regression model with BPRS rating as response and Time(weeks) and Treatment as explanatory variables.

BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
confint(BPRS_reg)
##                 2.5 %    97.5 %
## (Intercept) 43.765472 49.142306
## week        -2.766799 -1.774035
## treatment2  -1.991083  3.135527

From the p values, time(week) is significant factor of the BPRS rating, but treatment does not significant influence on the rating. For every new week, they on average experienced an decrease of rating by 2.27 (95%CI -2.77 to -1.77) from the previous one.

3.3 The Random Intercept Model

The previous model assumes independence of the repeated measures of rating, and this assumption is highly unlikely. To conduct the more formal analysis of the BPRS rating data, we will first fit the random intercept model for the same two explanatory variables: Time(weeks) and Treatment. Fitting a random intercept model allows the linear regression fit by considering that subjects have different rating baselines, which referred to the random intercept for each individual.

However, in the BPRSL data, various subjects in both treatments use number 1 to 20, which means the same subject number refers to two different individuals receiving two treatments. This will cause problem in the mixed-effect modeling since the subject will be included in model. So we need to convert the subject numbers first.

library(dplyr)
BPRSL_new <- BPRSL |> 
  mutate(subject = as.numeric(subject)) |>  
  mutate(subject = ifelse(treatment ==2, subject + 20, subject))
glimpse(BPRSL_new)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Now fit the random intercept model.

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL_new, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL_new
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000
confint(BPRS_ref)
## Computing profile confidence intervals ...
##                 2.5 %    97.5 %
## .sig01       7.918218 12.645375
## .sigma       6.828332  7.973573
## (Intercept) 41.744598 51.163179
## week        -2.565919 -1.974914
## treatment2  -5.885196  7.029641

The fixed effect part of model summary here shows same results of coefficient with multiple linear regression in 3.2. We focus more on the random effect part of the model summary. On average (see Std.Dev. of subject), BPRS rating bounces around 9.87 (95%CI: 41.74 to 51.16) as moving from one participant to another. In fact, the influence of individual differences is even larger than the effect of time - the significant predictor evidenced in the fixed linear model (coefficient -2.27, 95%CI: -2.56 to -1.97).

3.4 Random Intercept and Random Slope Model

Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. So it is possible to account for the individual differences in the BPRS rating profiles, but also the effect of time and treatment.

BPRS_refboth <- lmer(bprs ~  week + treatment + (treatment | subject), 
                  data = BPRSL_new, 
                  REML = FALSE)
summary(BPRS_refboth)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (treatment | subject)
##    Data: BPRSL_new
## 
##      AIC      BIC   logLik deviance df.resid 
##   2584.8   2612.0  -1285.4   2570.8      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3066 -0.6047 -0.0635  0.4402  3.1691 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  subject  (Intercept) 64.43    8.027        
##           treatment2  57.38    7.575    0.07
##  Residual             54.23    7.364        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9708  23.571
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.305       
## treatment2 -0.556  0.000

We focus more on the random effect part as well here. On average, BPRS rating bounces around 8.027 (see Std.Dev. of subject) as moving from one participant to another, lower than in random intercept model. The deviation for the treatment is 7.58, much higher than in fixed effect model (even it shows insignificant). This indicates as the changing of individuals, the effect of a different treatment could be huge, and this is an important finding since this shed some lights on why a different treatment does not show significantly different effect.

In a word, the individual differences have a great effect on the BPRS rating, even larger than the effect of time and treatment; and the individual differences also have great effect on the reaction to different treatments.

3.5 Random Intercept and Random Slope Model with interaction

Fitting a random intercept and slope model that allows for a group × time interaction as a final work.

BPRS_interact <- lmer(bprs ~  week * treatment + (treatment | subject), 
                  data = BPRSL_new, 
                  REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
summary(BPRS_interact)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (treatment | subject)
##    Data: BPRSL_new
## 
##      AIC      BIC   logLik deviance df.resid 
##   2581.0   2612.1  -1282.5   2565.0      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2801 -0.6048 -0.0810  0.4414  3.3943 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.53    8.033         
##           treatment2  71.91    8.480    -0.04
##  Residual             53.27    7.298         
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.0573  23.275
## week             -2.6283     0.2107 -12.475
## treatment2       -2.2911     3.4296  -0.668
## week:treatment2   0.7158     0.2980   2.402
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.410              
## treatment2  -0.600  0.246       
## wek:trtmnt2  0.290 -0.707 -0.348
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
# Perform an ANOVA test on the two models
anova(BPRS_refboth, BPRS_interact)
## Data: BPRSL_new
## Models:
## BPRS_refboth: bprs ~ week + treatment + (treatment | subject)
## BPRS_interact: bprs ~ week * treatment + (treatment | subject)
##               npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_refboth     7 2584.8 2612.0 -1285.4   2570.8                       
## BPRS_interact    8 2581.0 2612.1 -1282.5   2565.0 5.7203  1    0.01677 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Paying attention to the likelihood ratio test chi-squared value and the according p-value. The lower the value the better the fit against the comparison model. It can be seen that the interaction between week and treatment is significant.